A Probabilistic Genome-Wide Gene Reading Frame Sequence Model
نویسندگان
چکیده
We introduce a new type of probabilistic sequence model, that model the sequential composition of reading frames of genes in a genome. Our approach extends gene finders with a model of the sequential composition of genes at the genome-level – effectively producing a sequential genome annotation as output. The model can be used to obtain the most probable genome annotation based on a combination of i: a gene finder score of each gene candidate and ii: the sequence of the reading frames of gene candidates through a genome. The model — as well as a higher order variant — is developed and tested using the probabilistic logic programming language and machine learning system PRISM a fast and efficient model prototyping environment, using bacterial gene finding performance as a benchmark of signal strength. The model is used to prune a set of gene predictions from an underlying gene finder and are evaluated by the effect on prediction performance. Since bacterial gene finding to a large extent is a solved problem it forms an ideal proving ground for evaluating the explicit modeling of larger scale gene sequence composition of genomes. We conclude that the sequential composition of gene reading frames is a consistent signal present in bacterial genomes and that it can be effectively modeled with probabilistic sequence models.
منابع مشابه
Detection and Characterization of Weissellicin 110, a Bacteriocin Produced by Weissella cibaria
Background: Weissellicin 110 is the only bacteriocin reported in Weissella cibaria up to now. This bacteriocin represents several unique features. This is the first report on the gene sequence that encodes for the bacteriocin. Objectives: Providing a rapid detection method to isolate the weissellicin 110 encoding gene and determination of the bacteriocin distribution were the objectives. Materi...
متن کاملProbabilistic methods of identifying genes in prokaryotic genomes: Connections to the HMM theory
In this paper, we review developments in probabilistic methods of gene recognition in prokaryotic genomes with the emphasis on connections to the general theory of hidden Markov models (HMM). We show that the Bayesian method implemented in GeneMark, a frequently used gene-finding tool, can be augmented and reintroduced as a rigorous forward-backward (FB) algorithm for local posterior decoding d...
متن کاملA Classification Approach to Comparative Gene Finding in Mammals
Evolutionary conservation is a powerful signal that can be used to identify protein-coding genes within related genomes. Promising early approaches ([1], [3], [4], [5]) considered conservation between two species, typically human and mouse, to augment existing ab initio gene finding approaches. The growing availability of genomes from many different species makes it possible to additionally con...
متن کاملA novel index which precisely derives protein coding regions from cross-species genome alignments.
We introduce here a novel index which precisely derives protein coding regions from cross-species genome alignments. The index is deeply related to frame recovery observed in coding sequence alignments, that is, if insertions or deletions of nucleotides causes frame shifts in coding regions, other in-dels which recover the reading frames will be often observed in the vicinity. In contrast, such...
متن کاملIn Silico Genome-Wide Screening for TnrA-Regulated Genes of Bacillus clausii
Bacillus clausii TnrA transcription factor is required for global nitrogen regulation. In order to obtain anoverview of gene regulation by TnrA in B. clausii KSMK16, the entire genome of B. clausii was screened forthe consensus sequence, 5’-TGTNAN7TNACA-3’ known as the TnrA box, and 13 transcription units werefound containing a putative TnrA box. The TnrA targets identified in...
متن کامل